N-Gram-Based Techniques for Arabic Text Document Matching; Case Study: Courses Accreditation
نویسندگان
چکیده
Measuring text similarity has been studied for a long time due to its importance in many applications in natural language processing and related areas such as Web-based document searching. One such possible application which is investigated in this paper is determining the similarity between course descriptions of the same subject for credit transfer among various universities or similar academic programs. In this paper, three different bi-gram techniques have been used to calculate the similarity between two or more Arabic documents which take the form of course descriptions. One of the techniques uses the vector model to represent each document in a way that each bi-gram is associated with a weight that reflects the importance of the bi-gram in the document. Then the cosine similarity is used to compute the similarity between the two vectors. The other two techniques are: word-based and whole document-based evaluation techniques. In both techniques, the Dice’s similarity measure has been applied for calculating the similarity between any given pair of documents. The results of this research indicate that the first technique has demonstrated better performance than the other two techniques as viewed with respect to the human judgment.
منابع مشابه
Language Independent n-Gram-Based Text Categorization with Weighting Factors: A Case Study
We introduce a new language independent text categorization technique based on byte-level n-gram profiles, an n-gram weighting factors scheme, and a simple algorithm for comparing profiles. The technique does not require any morphological analysis of texts, any preprocessing steps, or any prior information about document content or language. We apply it to the text categorization problem in two...
متن کاملImproving KNN Arabic Text Classification with N-Grams Based Document Indexing
Text classification is the task of assigning a document to one or more of pre-defined categories based on its contents. This paper presents the results of classifying Arabic language documents by applying the KNN classifier, one time by using N-Gram namely unigrams and bigrams in documents indexing, and another time by using traditional single terms indexing method (bag of words) which supposes...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملKeyphrase Based Evaluation of Automatic Text Summarization
The development of methods to deal with the informative contents of the text units in the matching process is a major challenge in automatic summary evaluation systems that use fixed n-gram matching. The limitation causes inaccurate matching between units in a peer and reference summaries. The present study introduces a new Keyphrase based Summary Evaluator (KpEval) for evaluating automatic sum...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013